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Project  Overview 


Moderately -priced  scalable  high-performance 
connectionist  computation 


•  Tool  for  Artificial  Neural  Network  Research 

—  Training  Large  Nets  (up  to  109  parameters) 

—  Experiments  with  Real-World  I/O  (new  work  in  speech  and  vision) 

•  Research  &  Education  in  Parallel  Architectures,  VLSI,  Connectionist 
Software. 


Benchmark  Problem 

Evaluate  a  network  with  a  million  units  and  an  average  of  a  thousand 
connections  per  unit  for  a  total  of  a  billion  connections.  This  should  be 
done  100  times  per  second. 


Low-Moderate  Precision  Arithmetic 


•  Fast  Digital  Multipliers 
0(N2)  area. 

•  FP  makes  it  worse. 

•  High-precision  arithmetic 
high  operand  bandwidth. 
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Caches 

•  Full  precision  or  FP  not  needed  for  a  wide  class  of 
problems 

-  Example:  NN  for  speech  need  only  16b  values  and  8b 
weights. 


We  use  16x16  bit  multiplies  with  32b  accumulates. 


Processor  Organization 


How  does  one  organize  many  small  arithmetic  units? 

Based  on  our  experience: 

1.  Amdahl’s  Law  applies 

•  Need  a  well  integrated  general  purpose  processor. 

A  key  issue  is  finding  the  right  balance  of  special  purpose 
multiply /add  resources  and  general  purpose  processing. 

2.  Hardware  should  support  software  and  not  vice  versa. 
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Vector  Processing 


•  Vector  instruction  set  architecture  (ISA)  provides 
a  simple  abstraction  of  highly  regular  parallelism. 

•  Many  arithmetic  operations  specified  in  a  single 
instruction: 


[  al  a2  a3  . . .  an  ] 

OP 

[  bl  b2  b3  . . .  bn  ] 

♦ 

[  cl  c2  c3  . .  .  cn  ] 


[  al  a2  a3  . . .  an  ] 

OP 

S 

t 

[  cl  c2  c3  . . .  cn  ] 


•  Well  established  paradigm 

-  Well  understood  compiler  technology 

—  Many  existing  algorithms 

•  A  range  of  implementation  costs  and  performance  are 
possible  (without  changing  ISA). 


TO  Processor 


—  32  elements - H  "scalar  unit" 

Vector  Registers 


•  Vector  load/ store  architecture 

•  Two  vector  arithmetic  datapaths,  each  8-way  parallel 

•  Maximum  vector  length  of  32  elements 

•  Fixed-point  and  integer  (FP  emulation) 

•  Scalar  Unit  executes  MIPS-II  instruction  set 

—  general  scalar  computations  and 

—  address  generation  and  control  for  vector  units 

•  Single  instruction  issue 

•  Wide  memory  bus  with  various  load/store  options 


TO  System  Software 


Vector  Instructions  are  implemented  as  MIPS 
“coprocessor”  instructions  => 

•  Standard  MIPS  programming  tools  directly  usable: 

—  C  compiler 

-  assembler 

-  debugger  (gdb) 

-  system  library  (standard  I/O  routines) 

•  Handcrafted  vector  libraries  for  common  operations 

•  Cycle  accurate  RTL  simulator  (1100  cycle/sec  on 
SPARC  10/51) 

•  Fast  ISA  simulator  (100-500K  cycles/sec) 

•  FP  emulation  for  scalar  unit 

•  Yet  to  come: 

-  optimized  code  scheduler 

-  vectorizing  compiler 

—  FP  emulation  for  vector  units 

-  Fixed  point  to  FP  analyzer 


TO  VLSI  Implementation 


•  MOSIS/HP  CMOS26B  (1.0/zm),  50MHz 

•  800K  transistors 

•  800M  arithmetic  operations  per  second  (vector  units 
only) 

•  400M  operands/s  (800MB/sec)  memory  bandwidth 


TO  Performance 


•  Neural  Network  Computations  (16b  weights,  8b  activations) 

—  360  MC'PS  forward  pass 

—  100  MCUPS  backprop  training 

•  MPEG  decoding  (160  X  128  frame): 

—  iDCT:  1.02ms 

—  (sparc2  implementation:  21.07ms) 


TO  for  Vision 


Horizontal  Gradient  Calculation 
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Disparity  Calculation 


*  y 


Task 


Staple  Performance 


Sparc- 10 

(C++) 


Convolution  2.6  sec 

(5x7  kernel) 


Disparity 


10.08  sec 
(“focused”) 


0.0172  sec 

(8x8  kernel) 


0.1671  sec 

(“raw”) 

0.0668  sec 

(“focused”) 


TO  Resource  TO 
Utilization  Performance 


arith:  500  MOPS/sec 
mem:  450  MB/sec 


):  .20  arith:  224  MOPS/sec 

.36  mem:  664  MB/sec 
P:  .83 


*.  on  640  x  480  images 


Snert  Board 


SBUS  Connectors 


•  Sbus  (SUN  workstation)  board 

•  33  MBytes/s  input  /  40  MBytes/s  output  to  Xilinx 
(DMA) 

•  8  MBytes/s  I/O  via  Sbus  (programmed  I/O)  (measured) 

•  20MB/s  I/O  via  Sbus  DMA  (estimate) 

•  8  Mbytes  fast  local  SRAM 


5,776 


Multiple  Node  Systems 


•  T1  —  Node  in  MIMD  massively  parallel  processor 

—  network  interface  for  2-D  mesh  connections 

—  support  for  efficient  message  passing 

—  scalable  from  1  to  IK  nodes 


Hydrant  I/O  Interface 


Hydrant  I/O  Interfaces 


T  -  T1  H  -  Hydrant  X  -  Xilinx  FPGA 

•  Tl  side  —  8b  link  at  250  Mbytes/s 

•  Xilinx  side  — 

-  independent  address/data  pins  for  each  direction 

-  32b  wide  at  250  Mbytes/s 

•  RTL  level  design  complete,  partial  VLSI  layout. 


